There are many python packages for visualization. We'll focus on capabilities available through pandas, and to a lesser extent, matplotlib. There is great documentation on all of this. The case study is to analyze the flow of bicycles out of stations in the Pronto trip data.
In [1]:
import pandas as pd
import matplotlib.pyplot as plt
# The following ensures that the plots are in the notebook
%matplotlib inline
# We'll also use capabilities in numpy
import numpy as np
In [2]:
df = pd.read_csv("2015_trip_data.csv")
df.head()
Out[2]:
Now let's consider the popularity of the stations.
In [4]:
from_counts = pd.value_counts(df.from_station_id)
to_counts = pd.value_counts(df.to_station_id)
What kind of objects are returned from pd.value_counts? Are these plottable?
In [6]:
from_counts
Out[6]:
Our initial task is comparison - which stations are most popular. A bar plot seems appropriate. We can plot this using pandas.
In [7]:
from_counts.plot.bar()
Out[7]:
In [8]:
to_counts.plot.bar()
Out[8]:
We want if there is a general movement of bikes from one station to another. That is, are from and to counts out of balance. This is a comparison task. One approach is to combine the two bar plots in the same figure.
In [9]:
plt.subplot(3,1,1)
from_counts.plot.bar()
plt.subplot(3,1,3)
to_counts.plot.bar()
# Note the use of an empty second plot to provide space between the plots
Out[9]:
But this is deceptive since the two plots have different x-axis. So, first we'll make sure that the counts are ordered consistently. from_counts and to_counts are indexed by the station name. That is, to_counts['WF-01'] should have the value 7212.
In [10]:
# Script to put to_counts in the same order as from_counts
to_counts_list = []
for station in from_counts.index:
to_counts_list.append(to_counts[station])
An even better way to do this is to use a python "comprehension". This is a way to simplify short scripts into a single statement. A comprehension can be a list or a dict.
In [11]:
to_counts_list = [to_counts[station] for station in from_counts.index]
#ordered_to_counts = pd.Series(count_list, index=from_counts.index)
Now that we have from_counts and to_counts_list ordered with the same index, let's bundle this into a new dataframe.
Dictionaries are represented by expressions within curly braces ({,}) and provide a kind of associative memory.
In [16]:
a_dict = {'a': 1}
a_dict['a']
a_dict['a'] = 2
a_dict['b'] = [1,2,3]
In [17]:
a_dict
Out[17]:
In [19]:
b_dict = {'from': from_counts.values, 'to': to_counts_list}
In [21]:
df_counts = pd.DataFrame({'from': from_counts.values, 'to': to_counts_list}, index=from_counts.index)
df_counts.head()
Out[21]:
In [22]:
df_counts.index
Out[22]:
Let's do the plots for df_counts dataframe and compare where there are differences.
In [13]:
plt.subplot(3,1,1)
df_counts['from'].plot.bar()
plt.subplot(3,1,3)
df_counts['to'].plot.bar()
Out[13]:
This is really awkward to find the differences since we must move our eyes between the two plots. A better approach is to look at a single variable - outflow. This is the "from" counts minus the "to" counts. We'll define a new dataframe.
In [14]:
df_outflow = pd.DataFrame({'outflow':df_counts['to'] - df_counts['from']}, index=df_counts.index)
df_outflow.plot.bar(legend=False)
Out[14]:
We can make this readable by only looking at stations with large outflows, either positive or negative.
In [23]:
min_flow = 500
sel = abs(df_outflow.outflow) > min_flow
df_outflow_small = df_outflow[sel]
df_outflow_small.plot.bar(legend=False)
Out[23]: